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Abstract This chapter deals with decentralized learning algorithms for in-network 
processing of graph-valued data. A generic learning problem is formulated and re¬ 
cast into a separable form, which is iteratively minimized using the alternating- 
direction method of multipliers (ADMM) so as to gain the desired degree of paral¬ 
lelization. Without exchanging elements from the distributed training sets and keep¬ 
ing inter-node communications at affordable levels, the local (per-node) learners 
consent to the desired quantity inferred globally, meaning the one obtained if the 
entire training data set were centrally available. Impact of the decentralized learn¬ 
ing framework to contemporary wireless communications and networking tasks is 
illustrated through case studies including target tracking using wireless sensor net¬ 
works, unveiling Internet traffic anomalies, power system state estimation, as well 
as spectrum cartography for wireless cognitive radio networks. 
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1 Introduction 

This chapter puts forth an optimization framework for learning over networks, that 
entails decentralized processing of training data acquired by interconnected nodes. 
Such an approach is of paramount importance when communication of training data 
to a central processing unit is prohibited due to e.g., communication cost or privacy 
reasons. The so-termed in-network processing paradigm for decentralized learning 
is based on successive refinements of local model parameter estimates maintained at 
individual network nodes. In a nutshell, each iteration of this broad class of fully de¬ 
centralized algorithms comprises; (i) a communication step where nodes exchange 
information with their neighbors through e.g., the shared wireless medium or In¬ 
ternet backbone; and (ii) an update step where each node uses this information to 
refine its local estimate. Devoid of hierarchy and with their decentralized in-network 
processing, local e.g., estimators should eventually consent to the global estimator 
sought, while fully exploiting existing spatiotemporal correlations to maximize es¬ 
timation performance. In most cases, consensus can formally be attained asymptot¬ 
ically in time. However, a finite number of iterations will suffice to obtain results 
that are sufficiently accurate for all practical purposes. 

In this context, the approach followed here entails reformulating a generic learn¬ 
ing task as a convex constrained optimization problem, whose structure lends itself 
naturally to decentralized implementation over a network graph. It is then possible 
to capitalize upon this favorable structure by resorting to the alternating-direction 
method of multipliers (ADMM), an iterative optimization method that can be traced 
back to p?j (see also pT[), and which is specially well-suited for parallel process¬ 
ing QD . This way simple decentralized recursions become available to update each 
node’s local estimate, as well as a vector of dual prices through which network-wide 
agreement is effected. 

Problem statement. Consider a network of n nodes in which scarcity of power 
and bandwidth resources encourages only single-hop inter-node communications, 
such that the /-th node communicates solely with nodes j in its single-hop neighbor¬ 
hood ,yVi. Inter-node links are assumed symmetric, and the network is modeled as an 
undirected graph whose vertices are the nodes and its edges represent the available 
communication links. As it will become clear through the different application do¬ 
mains studied here, nodes could be wireless sensors, wireless access points (APs), 
electrical buses, sensing cognitive radios, or routers, to name a few examples. Node 
i acquires m, measurements stacked in the vector y, € K"*' containing information 
about the unknown model parameters in s G which the nodes need to estimate. 
Let y := [y]^,... ,yj]^ G collect measurements acquired across the entire net¬ 
work. Many popular centralized schemes obtain an estimate s as follows 

s G argmmi;Li/;(s;y,). (1) 

In the decentralized learning problem studied here though, the summands ft are 
assumed to be local cost functions only known to each node i. Otherwise sharing 
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this information with a centralized processor, also referred to as fusion center (FC), 
can be challenging in various applications of interest, or, it may be even impossible 
in e.g., wireless sensor networks (WSNs) operating under stringent power budget 
constraints. In other cases such as the Internet or collaborative healthcare studies, 
agents may not be willing to share their private training data y,- but only the learning 
results. Performing the optimization ([T]) in a centralized fashion raises robustness 
concerns as well, since the central processor represents an isolated point of failure. 

In this context, the objective of this chapter is to develop a decentralized algo¬ 
rithmic framework for learning tasks, based on in-network processing of the locally 
available data. The described setup naturally suggests three characteristics that the 
algorithms should exhibit; cl) each node i= should obtain an estimate of s, 

which coincides with the cotTesponding solution s of the centralized estimator Q 
that uses the entire data {y,}'Li; c2) processing per node should be kept as simple as 
possible; and c3) the overhead for inter-node communications should be affordable 
and conhned to single-hop neighborhoods. It will be argued that such an ADMM- 
based algorithmic framework can be useful for contemporary applications in the 
domain of wireless communications and networking. 


Prior art. Existing decentralized solvers of Q can be classihed in two categories: 
Cl) those obtained by modifying centralized algorithms and operating in the primal 
domain; and C2) those handling an equivalent constrained form of Q (see in 
Section]^, and operating in the primal-dual domain. 

Primal-domain algorithms under Cl include the (sub)gradient method and its 


variants 137 


the incremental gradient method | |60) , the proximal gradient 


method |16|, and the dual averaging method |24 Each node in these meth¬ 


ods, averages its local iterate with those of neighbors and descends along its local 
negative (sub)gradient direction. However, the resultant algorithms are limited to in¬ 
exact convergence when using constant stepsizes | |57l[M| . If diminishing stepsizes 
are employed instead, the algorithms can achieve exact convergence at the price of 
slowing down speed |24 37 |^|. A constant-stepsize exact hrst-order algorithm is 
also available to achieve fast and exact convergence, by correcting error terms in the 
distributed gradient iteration with two-step historic information GD- 

Primal-dual domain algorithms under C2 solve an equivalent constrained form 
of Q, and thus drive local solutions to reach global optimality. The dual decom¬ 
position method is hence applicable because (sub)gradients of the dual function 
depend on local and neighboring iterates only, and can thus be computed without 
global cooperation m- ADMM modihes the dual decomposition by regularizing 
the constraints with a quadratic term, which improves numerical stability as well 
as rate of convergence, as will be demonstrated later in this chapter. Per ADMM 
iteration, each node solves a subproblem that can be demanding. Eortunately, these 
subproblems can be solved inexactly by running one-step gradient or proximal gra¬ 
dient descent iterations, which markedly mitigate the computation burden | 

A sequential distributed ADMM algorithm can be found in Ill- 


Chapter outline. The remainder of this chapter is organized as follows. Section 
[^describes a generic ADMM framework for decentralized learning over networks. 
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which is at the heart of all algorithms described in the chapter and was pioneered 
in 167 ^ for in-network estimation using WSNs. Section!^ focuses on batch esti¬ 
mation as well as (un)supervised inference, while Sectionffldeals with decentralized 


adaptive estimation and tracking schemes where network nodes collect data sequen¬ 
tially in time. Internet traffic anomaly detection and spectrum cartography for wire¬ 
less CR networks serve as motivating applications for the sparsity-regularized rank 
minimization algorithms developed in Section Fundamental results on the con¬ 
vergence and convergence rate of decentralized ADMM are stated in Section]^ 


2 In-Network Learning with ADMM in a Nutshell 

Since local summands in Q are coupled through a global variable s, it is not 
straightforward to decompose the unconstrained optimization problem in Q. To 
overcome this hurdle, the key idea is to introduce local variables := which 

represent local estimates of s per network node i 00 - Accordingly, one can for¬ 
mulate the constrained minimization problem 

G argnnnL”^! fiisi;yi), s. to S; = Sj, j G (2) 

The “consensus” equality constraints in ^ ensure that local estimates coincide 
within neighborhoods. Further, if the graph is connected then consensus naturally 
extends to the whole network, and it turns out that problems Q and (|^ are equiv¬ 
alent in the sense that s = Si =...=§„ fTO) . Interestingly, the formulation in (|^ 
exhibits a separable structure that is amenable to decentralized minimization. To 
leverage this favorable structure, the alternating direction method of multipliers 
(ADMM), see e.g., 0 pg. 253-261], can be employed here to minimize (|^ in a 
decentralized fashion. This procedure will yield a distributed estimation algorithm 
whereby local iterates S;(k), with k denoting iterations, provably converge to the 
centralized estimate s in Q; see also Section]^ 

To facilitate application of ADMM, consider the auxiliary variables iF := 
and reparameterize the constraints in with the equivalent ones 

G argnuni:;Li/i(s<;yi). 

s. to S; = z/and S; = Zp i 7 G c/1)’, i^j- (3) 

Variables z) are only used to derive the local recursions but will be eventually elim¬ 
inated. Attaching Lagrange multipliers 'f' := to the con¬ 

straints consider the augmented Lagrangian function 
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= £/;'(s;;y,') + £ Y. (v/)'^(s/-z/) + (v/)'^(sj-z/) 
1=1 !=i je,yyi 


+ ^EE 


,>l|2 


Si-Zjll +\\Sj-2‘t\ 


J\l2 


(4) 


where the constant c > 0 is a penalty coefficient. To minimize (|^, ADMM entails 
an iterative procedure comprising three steps per iteration k= 1 ,2,... 

[51] Multiplier updates: 

\j{k)=\j{k-l)+c[Si{k)-zj{k)] 
yj (k) =^{k-l) + c[S; {k) - z/ {k)]. 

[52] Local estimate updates: 


S^{k+l) = arg imnLc , ^{k),i^{k)]. 

[S3] Auxiliary variable updates: 


^{k+l) = arg imnLc ^,y{k)] 


where i — and j G ^ in [SI]. Reformulating the generic learning problem 

0 as 0 renders the augmented Lagrangian in 0 highly decomposable. The sep¬ 
arability comes in two flavors, both with respect to the sets ,5^ and ^ of primal 
variables, as well as across nodes i= This in turn leads to highly paral¬ 

lelized, simplified recursions corresponding to the aforementioned steps [S1]-[S3]. 
Specifically, as detailed in e.g., |29 48 51 68 -^, it follows that if the multipli¬ 
ers are initialized to zero, the ADMM-based decentralized algorithm reduces to the 
following updates carried out locally at every node 


In-network learning algorithm at node i, for ^ = 1,2,...: 
\i{k) =\i{k-l)+c Y [s/(^)-S;W] 

1 ) =argmin< /,(s,;y;)-fv/"(^)s,-fc Y 



(5) 

(6) 


where \i{k) := 2^^g^v/(^), and all initial values are set to zero. 

Recursions 0 and 0 entail local updates, which comprise the general purpose 
ADMM-based decentralized learning algorithm. The inherently redundant set of 
auxiliary variables in and corresponding multipliers have been eliminated. Each 
node, say the i-th one, does not need to separately keep track of all its non-redundant 
multipliers {v/ {k)}j^j^., but only to update the (scaled) sum v, (k). In the end, node i 
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has to store and update only two p-dimensional vectors, namely {s,(A:)} and {v,(A:)}. 
A unique feature of in-network processing is that nodes communicate their updated 
local estimates {s,} (and not their raw data y,) with their neighbors, in order to carry 
out the tasks Q-(|^ for the next iteration. 

As elaborated in Section]^ under mild assumptions on the local costs one can 
establish that limi-^ocs,(k) = s, for / = 1,... ,n. As a result, the algorithm asymptot¬ 
ically attains consensus and the performance of the centralized estimator [cf. Q]- 


3 Batch In-Network Estimation and Inference 


3.1 Decentralized Signal Parameter Estimation 


Many workhorse estimation schemes such as maximum likelihood estimation (MLE), 
least-squares estimation (LSE), best linear unbiased estimation (BLUE), as well as 
linear minimum mean-square error estimation (LMMSE) and the maximum a pos¬ 
teriori (MAP) estimation, all can be formulated as a minimization task similar to 
©; see e.g. p8] . However, the corresponding centralized estimation algorithms fall 
short in settings where both the acquired measurements and computational capabil¬ 
ities are distributed among multiple spatially scattered sensing nodes, which is the 
case with WSNs. Here we outline a novel batch decentralized optimization frame¬ 
work building on the ideas in Section that formulates the desired estimator as 
the solution of a separable constrained convex minimization problem tackled via 
ADMM; see e.g., 001 68][70) for further details on the algorithms outlined here. 

Depending on the estimation technique utilized, the local cost functions /;(•) 
in Q should be chosen accordingly, see e.g., | [3^[^[70l . Eor instance, when s is 
assumed to be an unknown deterministic vector, then; 


• If s corresponds to the centralized MLE then /,(s;y,) = —ln[p,(y,;s)] is the neg¬ 

ative log-likelihood capturing the data probability density function (pdf), while 
the network-wide data are assumed statistically independent. 

• If s corresponds to the BLUE (or weighted least-squares estimator) then /, (s; y,) = 
(l/2)||21j^*/^(y; — H,s)|p, where Ey. denotes the covariance of the data y,-, and 
H, is a known fitting matrix. 


When s is treated as a random vector, then: 


• If s corresponds to the centralized MAPestimator then/, (s;y,) = — (ln[/7,(y, |s)] + 

ln[/7(s)]) accounts for the data pdf, and p{s) for the prior pdf of s, while data 
{y,}"^i are assumed conditionally independent given s. 

• If s corresponds to the centralized LMMSE then/,(s;y,) = {l/2)\\s — nEsy.\i‘\\\, 
where E^y- denotes the cross-covariance of s with y,, while u' stands for the /-th 
m,' X 1 block subvector of u = X^'y- 

Substituting in (|^ the specific /,(s;yj) for each of the aforementioned estimation 
tasks, yields a family of batch ADMM-based decentralized estimation algorithms. 
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The decentralized BLUE algorithm will be described in this section as an example 
of decentralized linear estimation. 

Recent advances in cyber-physical systems have also stressed the need for de¬ 
centralized nonlinear least-squares (LS) estimation. Monitoring the power grid for 
instance, is challenged by the nonconvexity arising from the nonlinear AC power 
flow model; see e.g., [8^ Ch. 4], while the interconnection across local transmis¬ 
sion systems motivates their operators to collaboratively monitor the global sys¬ 
tem state. Interestingly, this nonlinear (specifically quadratic) estimation task can 
be convexified to a semidefinite program (SDP) ||^ pg. 168], for which a decentral¬ 
ized semidefinite programming (SDP) algorithm can be developed by leveraging the 
batch ADMM; see also for an ADMM-based centralized SDP precursor. 


3.1.1 Decentralized BLUE 


The minimization involved in ^ can be performed locally at sensor / by employing 
numerical optimization techniques Q. There are cases where the minimization in 
(|^ yields a closed-form and easy to implement updating formula for Si{k+1). If for 
example network nodes wish to find the BLUE estimator in a distributed fashion, the 
local cost is/,(s;yi) = [I/2)\\Ey^/^{yi-nis)\\^, and @ becomes a strictly convex 
unconstrained quadratic program which admits the following closed-form solution 
(see details in | 


s,(k+l)= (hTi:->H; + 2cK|1, 




(7) 


The pair (|^ and (j7| comprise the decentralized (D-) BLUE algorithm |67 Eor 
the special case where each node acquires unit-variance scalar observations y,-, there 
is no fitting matrix and s is scalar (i.e., p = 1); D-BLUE offers a decentralized 
algorithm to obtain the network-wide sample average s = (1 /n)Y!i=\yi- The update 
rule for the local estimate is obtained by suitably specializing Q to 


Si{k+ 1 ) — (1 +2c\^i\) 


yi-Vi{k)+c £ {si{k)+sj{k)) 


(8) 


Different from existing distributed averaging approaches |4 p2|[83p4) , the ADMM- 
based one originally proposed in |67 701 allows the decentralized computation of 


general nonlinear estimators that may be not available in closed form and cannot be 
expressed as “averages.” Eurther, the obtained recursions exhibit robustness in the 
presence of additive noise in the inter-node communication links. 
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3.1.2 Decentralized SDP 


Consider now that each scalar in y, adheres to a quadratic measurement model 
in s plus additive Gaussian noise, where the centralized MLE requires solving a 
nonlinear least-squares problem. To tackle the nonconvexity due to the quadratic 
dependence, the task of estimating the state s can be reformulated as that of estimat¬ 
ing the outer-product matrix S := ss^. In this reformulation yf is a linear function 
of S, given by Tr(H^S) with a known matrix |87 . Motivated by the separable 
structure in the nonlinear estimation problem can be similarly formulated as 


{S;}'Li e argmin^£ [yf-Tr(HfS) 


i=l 


s. to S; = Zj and Sj=z\, i = 1,..., n, 7 € Mf, i 7 ^ j 
Si ^ 0 and rank(S,) = 1, 


(9) 


where the positive-semidefiniteness and rank constraints ensure that each matrix 
Si is an outer-product matrix. By dropping the non-convex rank constraints, the 
problem (|^ becomes a convex semidefinite program (SDP), which can be solved in 
a decentralized fashion by adopting the batch ADMM iterations Q and (|^. 

This decentralized SDP approach has been successfully employed for monitoring 
large-scale power networks p2| . To estimate the complex voltage phasor all nodes 
(a.k.a. power system state), measurements are collected on real/reactive power and 
voltage magnitude, all of which have quadratic dependence on the unknown states. 
Gauss-Newton iterations have been the ‘workhorse’ tool for this nonlinear estima¬ 
tion problem; see e.g., HI®. However, the iterative linearization therein could 
suffer from convergence issues and local optimality, especially due to the increas¬ 
ing variability in power grids with high penetration of renewables. With improved 
communication capabilities, decentralized state estimation among multiple control 
centers has attracted growing interest; see Fig. illustrating three interconnected 
areas aiming to achieve the centralized estimation collaboratively. 




Fig. 1 (Left:) Schematic of collaborative power system state estimation among control centers of 
three interconnected networks (IEEE 118-bus test case). (Right:) Local state estimation error vs. 
iteration number using the decentralized SDP-based state estimation method. 
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A decentralized SDP-based state estimator has been developed in | [87) with re¬ 
duced complexity compared to 0. The resultant algorithm involves only internal 
voltages and those of next-hop neighbors in the local matrix S(,); e.g., in Fig.|^S(i) is 
identified by the dashed lines. Interestingly, the positive-semidefiniteness constraint 
for the overall S decouples nicely into that of all local {S,}, and the estimation error 
converges to the centralized performance within only a dozen iterations. The de¬ 
centralized SDP framework has successfully addressed a variety of power system 
operational challenges, including a distributed microgrid optimal power flow solver 
in |T8); see also ||32| for a tutorial overview of these applications. 


3.2 Decentralized Inference 

Along with decentralized signal parameter estimation, a variety of inference tasks 
become possible by relying on the collaborative sensing and computations per¬ 
formed by networked nodes. In the special context of resource-constrained WSNs 
deployed to determine the common messages broadcast by a wireless AP, the rela¬ 
tively limited node reception capability makes it desirable to design a decentralized 
detection scheme for all sensors to attain sufficient statistics for the global problem. 
Another exciting application of WSNs is environmental monitoring for e.g., infer¬ 
ring the presence or absence of a pollutant over a geographical area. Limited by the 
local sensing capability, it is important to develop a decentralized learning frame¬ 
work such that all sensors can collaboratively approach the performance as if the 
network wide data had been available everywhere (or at a FC for that matter). Given 
the diverse inference tasks, the challenge becomes how to design the best inter-node 
information exchange schemes that would allow for minimal communication and 
computation overhead in specific applications. 


3.2.1 Decentralized Detection 

Message decoding. A decentralized detection framework is introduced here for the 
message decoding task, which is relevant for diverse wireless communications and 
networking scenarios. Consider an AP broadcasting a p x 1 coded block s to a net¬ 
work of sensors, all of which know the codebook la that s belongs to. For simplicity 
assume binary codewords, and that each node i = receives a same-length 

block of symbols y, through a discrete, memoryless, symmetric channel that is con¬ 
ditionally independent across sensors. Sensor i knows its local channel from the AP, 
as characterized by the conditional pdf p{yii\si) per bit 1. Due to conceivably low 
signal-to-noise-ratio (SNR) conditions, each low-cost sensor may be unable to re¬ 
liably decode the message. Accordingly, the need arises for information exchanges 
among single-hop neighboring sensors to achieve the global (that is, centralized) 
error performance. Given y, per sensor i, the assumption on memoryless and inde¬ 
pendent channels yields the centralized maximum-likelihood (ML) decoder as 
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^ argmi^/7({y,}^^i|s) = argminlf^j I'Li [-logp{yii\si)]. (10) 

ML decoding amounts to deciding the most likely codeword among multiple can¬ 
didate ones and, in this sense, it can be viewed as a test of multiple hypotheses. 
In this general context, belief propagation approaches have been developed in | [66| , 
so that all nodes can cooperate to learn the centralized likelihood per hypothesis. 
However, even for linear binary block codes, the number of hypotheses, namely the 
cardinality of grows exponentially with the codeword length. This introduces 
high communication and computation burden for the low-cost sensor designs. 

The key here is to extract minimal sufficient statistics for the centralized decoding 
problem. For binary codes, the log-likelihood terms in ( [TOl l become logp(y, 7 |i;) = 
-Yiisi+logp{yii\si = 0), where 


Yii ■= log 


( p{yu\si=o) \ 

\p{yii\si = 1 )/ 


( 11 ) 


is the local log-likelihood ratio (LLR) for the bit i/ at sensor i. Ignoring all con¬ 
stant terms logp(yj 7 |i/ = 0), the ML decoding objective ends up only depending on 
the sum LLRs, as given by Sml = ^gnrinse-^LjLj {T!i=i Yii)^i- Clearly, the sufficient 
statistic for solving is the sum of all local LLR terms, or equivalently, the av¬ 
erage 7 / = (l/n)L"=i Yu for each bit 1. Interestingly, the average of {Yii}'i=\ is one 


instance of the BLUE discussed in Section 3.1.1 when £ 


H,=I 


pxp. 


Yi = sugmmyYJl^i {Yu - Y? 


( 12 ) 


This way, the ADMM-based decentralized learning framework in Section [fallows 
for all sensors to collaboratively attain the sufficient statistic for the decoding prob¬ 
lem ( [T0| ) via in-network processing. Each sensor only needs to estimate a vector 
of the codeword length p, which bypasses the exponential complexity under the 
framework of belief propagation. As shown in | [89| , decentralized soft decoding is 
also feasible since the a posteriori probability (APP) evaluator also relies on LLR 
averages which are sufficient statistics, where extensions to non-binary alphabet 
codeword constraints and random failing inter-sensor links are also considered. 

The bit error rate (BER) versus SNR plot in Eig. [^demonstrates the performance 
of ADMM-based in-network decoding of a convolutional code with p = 60 and 
I'i^l = 40. This numerical test involves n— 10 sensors and AWGN AP-sensor chan¬ 
nels with of = Eour schemes are compared: (i) the local ML decoder 

based on per-sensor data only (corresponds to the curve marked as k = 0 since it 
is used to initialize the decentralized iterations); (ii) the centralized benchmark ML 
decoder (corresponds to k = oo); (iii) the in-network decoder which forms 7/ using 
“consensus-averaging” linear iterations | |8^ ; and, (iv) the ADMM-based decentral¬ 
ized algorithm. Indeed, the ADMM-based decoder exhibits faster convergence than 
its consensus-averaging counterpart; and surprisingly, only 10 iterations suffice to 
bring the decentralized BER very close to the centralized performance. 
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Fig. 2 BER vs. SNR (in dB) curves depicting the local ML decoder vs. the consensus-averaging 
decoder vs. the ADMM-based approach vs. the centralized ML decoder benchmark. 


Message demodulation. In a related detection scenario the common AP message s 
can be mapped to a space-time matrix, with each entry drawn from a hnite alphabet 
si. The received block y, per sensor i typically admits a linear input/output relation¬ 
ship y,- = H; s -f Ei- Matrix H; is formed from the fading AP-sensor channel, and Et 
stands for the additive white Gaussian noise of unit variance, that is assumed uncor¬ 
related across sensors. Since low-cost sensors have very limited budget on number 
of antennas compared to the AP, the length of y, is much shorter than s (i.e., m, < p). 
Hence, the local linear demodulator using {y,,H,} may not even be able to identify 
s. Again, it is critical for each sensor i to cooperate with its neighbors to collectively 
form the global ML demodulator 

^ - 1^=1 l|y; - H,s|p = arg max r,) s-s'^(Ri)4 

(13) 

where r, := H^y,- and R, := H^H, are the sample (cross-)covariance terms. To 
solve ([T3J locally, it suffices for each sensor to acquire the network-wide average 
of as well as that of as both averages constitute the minimal suffi¬ 

cient statistics for the centralized demodulator. Arguments similar to decentralized 
decoding lead to ADMM iterations that (as with BLUE) attain locally these aver¬ 
age terms. These iterations constitute a viable decentralized demodulation method, 
whose performance analysis in | [88] reveals that its error diversity order can ap¬ 
proach the centralized one within only a dozen of iterations. 

As demonstrated by the decoding and demodulation tasks, the cornerstone of 
developing a decentralized detection scheme is to extract the minimal sufficient 
statistics for the centralized hypothesis testing problem. This leads to signihcant 
complexity reduction in terms of communications and computational overhead. 
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3.2.2 Decentralized Support Vector Machines 


The merits of support vector machines (SVMs) in a centralized setting have been 
well documented in various supervised classification tasks including surveillance, 
monitoring, and segmentation, see e.g., di- These applications often call for de¬ 
centralized supervised learning solutions, when limited training data are acquired at 
different locations and a central processing unit is costly or even discouraged due to, 
e.g., scalability, communication overhead, or privacy reasons. Noteworthy examples 
include WSNs for environmental or structural health monitoring, as well as diagno¬ 
sis of medical conditions from patient’s records distributed at different hospitals. 

In this in-network classification task, a labeled training set := {(x, 7 ,y, 7 )} 
of size nij is available per node i, where x,/ G K'’ is the input data vector and 
y ,7 G { — 1,1} denotes its corresponding class label. Given all network-wide train¬ 
ing data {, the centralized S VM seeks a maximum-margin linear discriminant 
function ,|(x) = x^s -f b, by solving the following convex optimization problem Jti) 


{s,b} 


arg min 


2 


n rrii 

■CEE& 

1 = 1 /=! 


S. to y,7(s"^X,7-f/7) > 1-^ 7 , 

> 0 , 


i = 1,... ,n, I = I,... ,mi 
! = 1,... ,n, 1 = 1, 


(14) 


where the slack variables account for non-linearly separable training sets, and C 
is a tunable positive scalar that allows for controlling model complexity. Nonlinear 
discriminant functions g(x) can also be accommodated after mapping input vectors 
x ,7 to a higher- (possibly infinite)-dimensional space using e.g., kernel functions, 
and pursuing a generalized maximum-margin linear classifier as in ([T^. Since the 
SVM classifier ( [T4l l couples the local datasets, early distributed designs either rely 
on a centralized processor so they are not decentralized | |47| , or, their performance 
is not guaranteed to reach that of the centralized SVM |56|. 

A fresh view of decentralized SVM classification is taken in | [29l , which reformu¬ 
lates ( [T4l l to estimate the parameter pair {s,b} from all local data ,5} after eliminating 
slack variables ^, 7 , namely 


1 r. " 

{s,fe} = argmin - ||s|| -f ^ maxjO, 1-y, 7 (s'^x, 7 -f 1?)}. (15) 

s..b 2 

Notice that has the same decomposable structure that the general decen¬ 
tralized learning task in ([T]i, upon identifying the local cost /;(s;y,) = 5 ^ ||s|p -f 
- yii{s^Xii -G b)}, where s := and y,- := [y,i,... 

Accordingly, all network nodes can solve ([T5) in a decentralized fashion via iter¬ 
ations obtained following the ADMM-based algorithmic framework of Section]^ 
Such a decentralized ADMM-DSVM scheme is provably convergent to the central¬ 
ized SVM classifier ( [T4l i, and can also incorporate nonlinear discriminant functions 
as detailed in | |29l . 
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Fig. 3 Decision boundary comparison among ADMM-DSVM, centralized SVM and local SVM 
results for synthetic data generated from two Gaussian classes, and a network of « = 30 nodes. 


To illustrate the performance of the ADMM-DSVM algorithm in (291, consider a 
randomly generated network with n = 30 nodes. Each node acquires labeled training 
examples from two different classes, which are equiprobable and consist of random 
vectors drawn from a two-dimensional (i.e., p = 2) Gaussian distribution with com¬ 
mon covariance matrix = [1, 0; 0, 2], and mean vectors /fj = [—1, —1]^ and 
1^2 — [Ij 1]^’ respectively. The Bayes optimal classifier for this 2-class problem is 
linear | [25] Ch. 2]. To visualize this test case, Fig. [^depicts the global training set, 
along with the linear discriminant functions found by the centralized SVM ( |T4l i and 
the ADMM-DSVM at two different nodes after 400 iterations. Local SVM results 
for two different nodes are also included for comparison. It is apparent that ADMM- 
DSVM approaches the decision rule of its centralized counterpart, whereas local 
classifiers deviate since they neglect most of the training examples in the network. 


3.2.3 Decentralized Clustering 

Unsupervised learning using a network of wireless sensors as an exploratory infras¬ 
tructure is well motivated for inferring hidden structures in distributed data collected 
by the sensors. Different from supervised SVM-based classification tasks, each node 
has available a set of unlabeled observations := {x,/, I = 1,... ,m,}, 
drawn from a total of K classes. In this network setting, the goal is to design local 
clustering rules assigning each x,/ to a cluster k G {1,. • • ,K}. Again, the desiderata 
is a decentralized algorithm capable of attaining the performance of a benchmark 
clustering scheme, where all are centrally available for joint processing. 

Various criteria are available to quantify similarity among observations in a cen¬ 
tralized setting, and a popular selection is the deterministic partitional clustering 
(DPC) one entailing prototypical elements (a.k.a. cluster centroids) per class in or¬ 
der to avoid comparisons between every pair of observations. Let denote the 
prototype element for class k, and the membership coefficient of x,/ to class k. 
A natural clustering problem amounts to specifying the family of K clusters with 
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Fig. 4 Average performance of hard-DKM on a real data set using a WSN with n = 20 nodes for 
various values of 7] and K (left). Clustering with K = 3 and J] = 5 (right) at k = 400 iterations. 


centroids such that the sum of squared-errors is minimized; that is 


n nij K 

min 

,=i ;=i jt=i 


(16) 


where p > 1 is a tuning parameter, and Y := {vuk : 1 . [ 0 , 1 ], Vt ,0 

denotes the convex set of constraints on all membership coefficients. With p = 1 
and fixed, ( |T6] l becomes a linear program in Vnk- Consequently, ( |T6] l admits 
binary { 0 , 1 } optimal solutions giving rise to the so-termed hard assignments, by 
choosing the cluster k for x,/ whenever Vuk = 1. Otherwise, for p > 1 the optimal 
coefficients generally result in soft membership assignments, and the optimal cluster 
is A:* := argmax^ for x,/. In either case, the DPC clustering problem ( [Thl l is NP- 
hard, which motivates the (suboptimal) K-means algorithm that, on a per iteration 
basis, proceeds in two-steps to minimize the cost in ( [T6] l w.r.t.: (SI) Y with {p^.} 
fixed; and (S2) {p,;.} with Y fixed |44|. Convergence of this two-step alternating- 
minimization scheme is guaranteed at least to a local minimum. Nonetheless, K- 
means requires central availability of global information (those variables that are 
fixed per step), which challenges in-network implementations. For this reason, most 
early attempts are either confined to specific communication network topologies, or, 
they offer no closed-form local solutions; see e.g., |58[8T|. 

To address these limitations, |30| casts ( [T6| l [yet another instance of (0] as a 
decentralized estimation problem. It is thus possible to leverage ADMM iterations 
and solve in a decentralized fashion through information exchanges among 
single-hop neighbors only. Albeit the non-convexity of ( [T6] l, the decentralized DPC 
iterations in p 0 | provably approach a local minimum arbitrarily closely, where the 
asymptotic convergence holds for hard K-means with p = 1. Further extensions 
in m include a decentralized expectation-maximization algorithm for probabilistic 
partitional clustering, and methods to handle unknown number of classes. 


Clustering of oceanographic data. Environmental monitoring is a typical applica¬ 
tion of WSNs. In WSNs deployed for oceanographic monitoring, the cost of compu- 













Decentralized Learning for Wireless Communications and Networking 


15 


tation per node is lower than the cost of accessing each node’s observations Q. This 
makes the option of centralized processing less attractive, thus motivating decentral¬ 
ized processing. Here we test the decentralized DPC schemes of pO) on real data 
collected by multiple underwater sensors in the Mediterranean coast of Spain in, 
with the goal of identifying regions sharing common physical characteristics. A to¬ 
tal of 5,720 feature vectors were selected, each having entries the temperature (°C) 
and salinity (psu) levels {p = 2). The measurements were normalized to have zero 
mean, unit variance, and they were grouped in n = 20 blocks (one per sensor) of 
OT, = 286 measurements each. The algebraic connectivity of the WSN is 0.2289 
and the average degree per node is 4.9. Fig. (left) shows the performance of 25 
Monte Carlo runs for the hard-DKM algorithm with different values of the parame¬ 
ter c := rj. The best average convergence rate was obtained for TJ =5, attaining the 
average centralized performance after 300 iterations. Tests with different values of 
K and rj are also included in Fig. (left) for comparison. Note that for K = 2 and 
TJ = 5 hard-DKM hovers around a point without converging. Choosing a larger Tj 
guarantees convergence of the algorithm to a unique solution. The clustering results 
of hard-DKM at k = 400 iterations for Tj = 5 and K = 3 are depicted in Fig.|^(right). 


4 Decentralized Adaptive Estimation 

Sections[^and[^dealt with decentralized batch estimation, whereby network nodes 
acquire data only once and then locally exchange messages to reach consensus on 
the desired estimators. In many applications however, networks are deployed to per¬ 
form estimation in a constantly changing environment without having available a 
complete statistical description of the underlying processes of interest, e.g., with 
time-varying thermal or seismic sources. This motivates the development of de¬ 
centralized adaptive estimation schemes, where nodes collect data sequentially in 
time and local estimates are recursively refined “on-the-fly.” In settings where sta¬ 
tistical state models are available, it is prudent to develop model-based tracking 
approaches implementing in-network Kalman or particle filters. Next, Section [^s 
scope is broadened to facilitate real-time (adaptive) processing of network data, 
when the local costs in Q and unknown parameters are allowed to vary with time. 


4.1 Decentralized Least-Mean Squares 


A decentralized least-mean squares (LMS) algorithm is developed here for adaptive 
estimation of (possibly) nonstationary parameters, even when statistical informa¬ 
tion such as ensemble data covariances are unknown. Suppose network nodes are 
deployed to estimate a signal vector s(f) G in a collaborative fashion sub¬ 

ject to single-hop communication constraints, by resorting to the linear LMS cri¬ 
terion, see e.g., ||46l 69 74). Per time instant t = 0,1,2,..., each node has avail- 
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able a regression vector hj(f) G 


5X1 


and acquires a scalar observation y,(f), 


both assumed zero-mean without loss of generality. Introducing the global vector 
y(f) ;= [yi(f).. .y„(f)]''' G and matrix H(f) := [hi(f).. .h„(f)]''' G the 

global time-dependent LMS estimator of interest can be written as ||46||69][74 p. 14] 


s(f) :=argminE[||y(f)-H(f)sf] = argmini:]Li E [(y,'(f) - hT(f)s)2] . (17) 


For jointly wide-sense stationary {x(f),H(f)}, solving (17 1 leads to the well-known 
Wiener filter estimate where := E[H^(f)H(f)] and Euy '■= 

E[H'r(f)y(f)]; see e.g., ^ p. 15]. 

For the cases where the auto- and cross-covariance matrices Eh and Eny are 
unknown, the approach followed here to develop the decentralized (D-) LMS al¬ 
gorithm includes two main building blocks; (i) recast ( [TtI i into an equivalent form 
amenable to in-network processing via the ADMM framework of Section and 
(ii) leverage stochastic approximation iterations | |40) to obtain an adaptive LMS- 
like algorithm that can handle the unavailability/variation of statistical information. 
Following those algorithmic construction steps outlined in Section]^ the following 
updating recursions are obtained for the multipliers v,(f) and the local estimates 
s, (f -b 1) at time instant f -b 1 and i= 


v,-(f) =v,-(f-l)-bc ^ [Si{t)-Sj{t)] 

Si(f -b 1) = argmin < E -b 1) - h/"(f -b l)s,-)^j -b v/" (f)s. 


(18) 


+^E 


Si- 




(19) 


It is apparent that after differentiating ([T^ and setting the gradient equal to zero, 
s, (f -b 1) can be obtained as the root of an equation of the form 


E[<p(s,-,y,(f + l),h,(f + l))]=0 (20) 

where (p corresponds to the stochastic gradient of the cost in However, the 
previous equation cannot be solved since the nodes do not have available any sta¬ 
tistical information about the acquired data. Inspired by stochastic approximation 
techniques (such as the celebrated Robbins-Monro algorithm; see e.g., | |40| Ch. 
1 ]) which iteratively find the root of ( |20l i given noisy observations {9(s,(f),y,(f -b 
l),h,(f -b 1 ))}J^ 0 ’ unknown expected value to obtain the fol¬ 

lowing D-LMS (i.e., stochastic gradient) updates 


S,'(f -b 1) — S,-(f) -bfl 


hi(f + + 1) 




S;(0] 


( 21 ) 
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D-LMS w/ noise: empirical (red) and Ideoreticai (biue) 


LUS theoreticai s.s. values 


J 


it; 


100 120 




Fig. 5 Tracking with D-LMS. (left) Local MSE performance metrics both with and without inter¬ 
node communication noise for sensors 3 and 12; and (right) True and estimated time-varying pa¬ 
rameters for a representative node, using slow and optimal adaptation levels. 


where jx denotes a constant step-size, and e;(f -|- 1 ) := 2 [y,(f -L 1 ) — (f -|- l)s,(f)] 
is twice the local a priori error. 

Recursions ( |T8] l and ( |2T] i constitute the D-LMS algorithm, which can be viewed 
as a stochastic-gradient counterpart of D-BLUE in Section 3.1.1 D-LMS is a pio¬ 
neering approach for decentralized online learning, which blends for the first time 
affordable (first-order) stochastic approximation steps with parallel ADMM itera¬ 
tions. The use of a constant step-size jx endows D-LMS with tracking capabilities. 
This is desirable in a constantly changing environment, within which e.g., WSNs 
are envisioned to operate. The D-LMS algorithm is stable and converges even in 


the presence of inter-node communication noise (see details in |55 691). Further, 
closed-form expressions for the evolution and the steady-state mean-square error 


(MSE), as well as selection guidelines for the step-size fX can be found in |551. 

Here we test the tracking performance of D-LMS with a computer simulation. 
For a random geometric graph with n = 20 nodes, network-wide observations y, are 
linearly related to a large-amplitude slowly time-varying parameter vector So(t) G 
Specifically, So(f) = 0So(f — 1) -f ^(f), where 0 = (1 — lO^"^)diag(0i,. ..,9p) 
with 0, ~ ‘^[0,1]. The driving noise is normally distributed with = lO^^Ip. To 
model noisy links, additive white Gaussian noise with variance 10^^ is present at 
the receiving end. For fx = 5 x 10-2, Fig. I (left) depicts the local performance 
of two representative nodes through the evolution of the excess mean-square er¬ 
ror EMSE,(f) = E[(h^(0[Si(^ ~ 1) ~ So(f — 1)])^] and the mean-square deviation 
MSD,(f) = E[||s,(f) — So(f)||2] figures of merit. Both noisy and ideal links are con¬ 
sidered, and the empirical curves closely follow the theoretical trajectories derived 
in Steady-state limiting values are also extremely accurate. As intuitively ex¬ 
pected and suggested by the analysis, a performance penalty due to non-ideal links 
is also apparent. Fig. (right) illustrates how the adaptation level affects the result¬ 
ing per-node estimates when tracking time-varying parameters with D-LMS. For 
p=5 X 10 (slow adaptation) and ^ = 5 x 10 ^ (near optimal adaptation), we 
depict the third entry of the parameter vector [so(f )]3 and the respective estimates 
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from the randomly chosen sixth node. Under optimal adaptation the local estimate 
closely tracks the true variations, while - as expected - for the smaller step-size 
D-LMS fails to provide an accurate estimate |55|74). 


4.2 Decentralized Recursive Least-Squares 


The recursive least-squares (RLS) algorithm has well-appreciated merits for reduc¬ 
ing complexity and storage requirements, in online estimation of stationary signals, 
as well as for tracking slowly-varying nonstationary processes | [38l[74) . RLS is es¬ 
pecially attractive when the state and/or data model are not available (as with LMS), 
and fast convergence rates are at a premium. Compared to the LMS scheme, RLS 
typically offers faster convergence and improved estimation performance at the cost 
of higher computational complexity. To enable these valuable tradeoffs in the con¬ 
text of in-network processing, the ADMM framework of Section]^ is utilized here 
to derive a decentralized (D-) RLS adaptive scheme that can be employed for dis¬ 
tributed localization and power spectrum estimation (see also | [5^|54) for further 
details on the algorithmic construction and convergence claims). 


Consider the data setting and linear regression task in Section 4.1 The RLS esti¬ 


mator for the unknown parameter So(f ) minimizes the exponentially weighted least- 
squares (EWLS) cost, see e.g.. 


Sewis(f) := argmin ^ ^ 

* T = 0!=1 


n 2 


yi{z)-h; {z)s +y's‘4>os 


( 22 ) 


where 7 S ( 0 , 1 ] is a forgetting factor, while the positive definite matrix ^0 is in¬ 
cluded for regularization. Note that in forming the EWLS estimator at time f, the 
entire history of data {y,(T),h,(T)}(j.^Q for i = is incorporated in the on¬ 

line estimation process. Whenever 7 < 1, past data are exponentially discarded thus 
enabling tracking of nonstationary processes. 

Again to decompose the cost function in ( |2^ , in which summands are coupled 
through the global variable s, we introduce auxiliary variables that represent 

local estimates per node i. These local estimates are utilized to form the convex 
constrained and separable minimization problem in ([^, which can be solved using 
ADMM to yield the following decentralized iterations (details in 152 541) 


V,-(f)=Vi(f-l)+C Y, Ht)-Sj{t)] 

Je-Jj 


(23) 


where 4>i{t+ 1) := ’^h,(T)h,^( t) +n 


(24) 
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^r'(f+i) = r'^r’(0- 

f+i 


Y+hJ{t + l)^;\t)hi{t+l) 


r,(f + l):= 

T=0 


+ 1-T 


= rWiit)+Mt + i)yi(f +1)- 


(25) 

(26) 


The D-RLS recursions ( |2^ and ( |24l i involve similar inter-node communication ex¬ 
changes as in D-LMS. It is recommended to initialize the matrix recursion with 
^L*(0) = * := 5lp, where 5 > 0 is chosen sufficiently large iTT). The local 


estimates in D-RLS converge in the mean-sense to the true sq (time-invariant case), 
even when information exchanges are imperfect. Closed-form expressions for the 
bounded estimation MSE along with numerical tests and comparisons with the in¬ 
cremental RLS p3| and diffusion RLS | [T3| algorithms can be found in | [52) . 

Decentralized spectrum sensing using WSNs. A WSN application where the 
need for linear regression arises, is spectram estimation for the purpose of environ¬ 
mental monitoring. Suppose sensors comprising a WSN deployed over some area of 
interest observe a narrowband source to determine its spectral peaks. These peaks 
can reveal hidden periodicities due to e.g., a namral heat or seismic source. The 
source of interest propagates through multi-path channels and is contaminated with 
additive noise present at the sensors. The unknown source-sensor channels may in¬ 
troduce deep fades at the frequency band occupied by the source. Thus, having each 
sensor operating on its own may lead to faulty assessments. The available spatial 
diversity to effect improved spectral estimates, can only be achieved via sensor col¬ 
laboration as in the decentralized estimation algorithms presented in this chapter. 

Let 0(f) denote the evolution of the source signal in time, and suppose that 0(f) 


can be modeled as an autoregressive (AR) process |76 p. 106] 


0(0 = - ar9{t-z)-\-w{t) 

T=1 


where p is the order of the AR process, while {a^} are the AR coefficients and w(f) 
denotes driving white noise. The source propagates to sensor i via a channel mod¬ 
eled as an FIR filter C,(z) = Ciiz^\ of unknown order L, and tap coefficients 
{cii} and is contaminated with additive sensing noise £,(f) to yield the observation 


L,-l 

= Y. Ciie{t-l)+ei{t). 

1=0 


Since y,(f) is an autoregressive moving average (ARMA) process, then |76| 


t=l t'=1 


(27) 


where the MA coefficients and the variance of the white noise process fj,(f) 
depend on {c,/}, {ctr} and the variance of the noise terms w(f) and £i{t). For the 






20 


Georgios B. Giannakis, Qing Ling, Gonzalo Mateos, loannis D. Schizas and Hao Zhu 




Frequency co (rads/sec) iteration index t 


Fig. 6 D-LMS in a power spectrum estimation task, (left) The true narrowband spectra is compared 
to the estimated PSD, obtained after the WSN runs the D-LMS and (non-cooperative) L-LMS 
algorithms. The reconstruction results correspond to a sensor whose multipath channel from the 
source introduces a null at(» = ;r/2 = 1.57. (right) Global MSE evolution (network learning curve) 
for the D-LMS and D-RLS algorithms. 


purpose of determining spectral peaks, the MA term in ( |Z7| l can be treated as ob¬ 
servation noise, i.e., £,(f) := — t'). This is very important since this 

way sensors do not have to know the source-sensor channel coefficients as well as 
the noise variances. Accordingly, the spectral content of the source can be estimated 
provided sensors estimate the coefficients {cCx}- To this end, let sq := [cti... 
be the unknown parameter of interest. From ( |Z7] i the regression vectors are given 
as h,(f) = [— y,(f — 1) ■ ■ ■ — yi{t — p)]^ , and can be acquired directly from the sensor 
measurements without the need of training/estimation. 

Performance of the decentralized adaptive algorithms described so far is illus¬ 
trated next, when applied to the aforementioned power spectrum estimation task. 
For the numerical experiments, an ad hoc WSN with n = 80 sensors is simulated as a 
realization of a random geometric graph. The source-sensor channels corresponding 
to a few of the sensors are set so that they have a null at the frequency where the AR 
source has a peak, namely at co — njl. Fig.|^(left) depicts the actual power spectral 
density (PSD) of the source as well as the estimated PSDs for one of the sensors 
affected by a bad channel. To form the desired estimates in a distributed fashion, 
the WSN runs the local (L-) LMS and the D-LMS algorithm outlined in Section 
|4.1| The L-LMS is a non-cooperative scheme since each sensor, say the /th, inde¬ 
pendently runs an LMS adaptive hlter fed by its local data {y,(f),h,(f)} only. The 
experiment involving D-LMS is performed under ideal and noisy inter-sensor links. 
Clearly, even in the presence of communication noise D-LMS exploits the spatial 
diversity available and allows all sensors to estimate accurately the actual spectral 
peak, whereas L-LMS leads the problematic sensors to misleading estimates. 

For the same setup. Fig. (right) shows the global learning curve evolution 
MSE(f) = (l/n)Lti |b,(f) - hT(f)s,(f - 1)1^. The D-LMS and the D-RLS algo¬ 
rithms are compared under ideal communication links. It is apparent that D-RLS 
achieves improved performance both in terms of convergence rate and steady state 
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MSE. As discussed in Section [42| this comes at the price of increased computational 
complexity per sensor, while the communication costs incurred are identical. 


4.3 Decentralized Model-based Tracking 


The decentralized adaptive schemes in Secs. 4.1 and 4.2 are suitable for tracking 


slowly time-varying signals in settings where no statistical models are available. 
In certain cases, such as target tracking, state evolution models can be derived and 
employed by exploiting the physics of the problem. The availability of such mod¬ 
els paves the way for improved state tracking via Kalman hltering/smoothing tech¬ 
niques, e.g., see Model-based decentralized Kalman hltering/smoothing as 

well as particle hltering schemes for multi-node networks are briehy outlined here. 

Initial attempts to distribute the centralized KF recursions (see | |59| and refer¬ 
ences in | [68) ) rely on consensus-averaging | |83| . The idea is to estimate across nodes 
those sufficient statistics (that are expressible in terms of network-wide averages) re¬ 
quired to form the corrected state and corresponding corrected state error covariance 
matrix. Clearly, there is an inherent delay in obtaining these estimates conhning the 
operation of such schemes only to applications with slow-varying state vectors So(f), 
and/or fast communications needed to complete multiple consensus iterations within 
the time interval separating the acquisition of consecutive measurements y,(f) and 
y,(f -f 1). Other issues that may lead to instability in existing decentralized KF ap¬ 
proaches are detailed in | |M) . 

Instead of hltering, the delay incurred by those inner-loop consensus iterations 
motivated the consideration of hxed-lag decentralized Kalman smoothing (KS) 
in 1^ . Matching consensus iterations with those time instants of data acquisition, 
hxed-lag smoothers allow sensors to form local MMSE optimal smoothed estimates, 
which take advantage of all acquired measurements within the “waiting period.” The 
ADMM-enabled decentralized KS in | |68l also overcomes the noise-related limita¬ 
tions of consensus-averaging algorithms | [M| . In the presence of communication 
noise, these estimates converge in the mean sense, while their noise-induced vari¬ 
ance remains bounded. This noise resiliency allows sensors to exchange quantized 
data further lowering communication cost. For a tutorial treatment of decentral¬ 
ized Kalman hltering approaches using WSNs (including the decentralized ADMM- 
based KS of |j^ and strategies to reduce the communication cost of state estimation 
problems), the interested reader is referred to | |6^ . These reduced-cost strategies ex¬ 
ploit the redundancy in information provided by individual observations collected at 
different sensors, different observations collected at different sensors, and different 
observations acquired at the same sensor. 

On a related note, a collaborative algorithm is developed in GZ) to estimate 
the channel gains of wireless links in a geographical area. Kriged Kalman hlter¬ 
ing (KKF) | |64) , which is a tool with widely appreciated merits in spatial statistics 
and geosciences, is adopted and implemented in a decentralized fashion leveraging 
the AD MM framework described here. The distributed KKF algorithm requires only 
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local message passing to track the time-variant so-termed “shadowing field” using a 
network of radiometers, yet it provides a global view of the radio frequency (RF) en¬ 
vironment through consensus iterations; see also Section 5.3 for further elaboration 
on spectrum sensing carried out via wireless cognitive radio networks. 

To wrap-up the discussion, consider a network of collaborating agents (e.g., 
robots) equipped with wireless sensors measuring distance and/or bearing from a 
target that they wish to track. Even if state models are available, the nonlineari¬ 
ties present in these measurements prevent sensors from employing the clairvoyant 
(linear) Kalman tracker discussed so far. In response to these challenges, p7| de¬ 
velops a set-membership constrained particle filter (PF) approach that: (i) exhibits 
performance comparable to the centralized PF; (ii) requires only communication of 
particle weights among neighboring sensors; and (iii) it can afford both consensus- 
based and incremental averaging implementations. Affordable inter-sensor commu¬ 
nications are enabled through a novel distributed adaptation scheme, which consid¬ 
erably reduces the number of particles needed to achieve a given performance. The 
interested reader is referred to | [M| for a recent tutorial account of decentralized PF 
in multi-agent networks. 


5 Decentralized Sparsity-regularized Rank Minimization 

Modern network data sets typically involve a large number of attributes. This fact 
motivates predictive models offering a sparse, broadly meaning parsimonious, rep¬ 
resentation in terms of a few attributes. Such low-dimensional models facilitate 
interpretability and enhanced predictive performance. In this context, this section 
deals with ADMM-based decentralized algorithms for sparsity-regularized rank 
minimization. It is argued that such algorithms are key to unveiling Internet traf¬ 
fic anomalies given ubiquitous link-load measurements. Moreover, the notion of RF 
cartography is subsequently introduced to exemplify the development of a paradigm 
infrastructure for situational awareness at the physical layer of wireless cognitive ra¬ 
dio (CR) networks. A (subsumed) decentralized sparse linear regression algorithm 
is outlined to accomplish the aforementioned cartography task. 


5.1 Network Anomaly Detection Via Sparsity and Low Rank 

Consider a backbone IP network, whose abstraction is a graph with n nodes (routers) 
and L physical links. The operational goal of the network is to transport a set of F 
origin-destination (OD) traffic flows associated with specific OD (ingress-egress 
router) pairs. Fet v/^ denote the traffic volume (in bytes or packets) passing through 
link I G {1,... ,L} over a fixed time interval {t,t -\- At). Fink counts across the en¬ 
tire network are collected in the vector x, G e.g., using the ubiquitous SNMP 
protocol. Single-path routing is adopted here, meaning a given flow’s traffic is car- 
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ried through multiple links connecting the corresponding source-destination pair 
along a single path. Accordingly, over a discrete time horizon f G [liT] the mea¬ 
sured link counts X ;= [x/ ,] S and (unobservable) OD flow traffic matrix 

Z := [z/;] € R^^^, are thus related through X = RZ |411, where the so-termed 
routing matrix R := \rij] G {0,1}^^^ is such that ri j = 1 if link I carries the flow 
/, and zero otherwise. The routing matrix is ‘wide,’ as for backbone networks the 
number of OD flows is much larger than the number of physical links {F ^ L). 
A cardinal property of the traffic matrix is noteworthy. Common temporal patterns 
across OD traffic flows in addition to their almost periodic behavior, render most 
rows (respectively columns) of the traffic matrix linearly dependent, and thus Z typ¬ 
ically has low rank. This intuitive property has been extensively validated with real 
network data; see Fig. |7]and e.g., | |4T| . 



Fig. 7 Volumes of 6 representative (out of 121 total) OD flows, taken from the operation of 
Internet-2 during a seven-day period. Temporal periodicities and correlations across flows are 
apparent. As expected, in this case Z can be well approximated by a low-rank matrix, since its 
normalized singular values decay rapidly to zero. 


It is not uncommon for some of the OD flow rates to experience unexpected 
abrupt changes. These so-termed traffic volume anomalies are typically due to (un¬ 
intentional) network equipment misconfiguration or outright failure, unforeseen be¬ 
haviors following routing policy modifications, or, cyber attacks (e.g., DoS attacks) 
which aim at compromising the services offered by the network |41 53 861. Let 
afj denote the unknown amount of anomalous traffic in flow / at time f, which one 
wishes to estimate. Explicitly accounting for the presence of anomalous flows, the 
measured traffic carried by link I is then given by y/ , = L/'D,/(2/,r +a/.r) + £/.r; t = 
where the noise variables ei, capture measurement errors and unmodeled 
dynamics. Traffic volume anomalies are (unsigned) sudden changes in the traffic of 
OD flows, and as such their effect can span multiple links in the network. A key dif¬ 
ficulty in unveiling anomalies from link-level measurements only is that oftentimes, 
clearly discernible anomalous spikes in the flow traffic can be masked through “de¬ 
structive interference” of the superimposed OD flows |411. An additional challenge 
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stems from missing link-level measurements yij, an unavoidable operational reality 
affecting most traffic engineering tasks that rely on (indirect) measurement of traffic 
matrices | [^ . To model missing link measurements, collect the tuples (l,t) asso¬ 
ciated with the available observations yi f in the set Q C [l,2,...,L]x[l,2,...,r]. 
Introducing the matrices Y := [y/,r],E ;= [e/_,] e and A := [a/,,] e the 

(possibly incomplete) set of link-traffic measurements can be expressed in compact 
matrix form as 

^n{Y) = ^ni^ + RA + E) (28) 

where the sampling operator ^n{.) sets the entries of its matrix argument not in 
Q to zero, and keeps the rest unchanged. Since the objective here is not to estimate 
the OD flow traffic matrix Z, ( |28] l is expressed in terms of the nominal (anomaly- 
free) link-level traffic rates X RZ, which inherits the low-rank property of Z. 
Anomalies in A are expected to occur sporadically over time, and last for a short 
time relative to the (possibly long) measurement interval [liT]. In addition, only a 
small fraction of the flows is supposed to be anomalous at a any given time instant. 
This renders the anomaly matrix A sparse across rows (flows) and columns (time). 

Recently, a natural estimator leveraging the low rank property of X and the spar¬ 
sity of A was put forth in pS] , which can be found at the crossroads of compressive 
sampling and timely low-rank plus sparse matrix decompositio ns |[TT][T4| . The 
idea is to fit the incomplete data (Y) to the model X 4- RA [cf. (|28|)] in the LS 
error sense, as well as minimize the rank of X, and the number of nonzero entries 
of A measured by its fo-(pseudo) norm. Unfortunately, albeit natural both rank and 
ffl-norm criteria are in general NP-hard to optimize. Typically, the nuclear norm 
•= (O)t(X) denotes the k-th singular value of X) and the fi-norm 

||A||i are adopted as surrogates 112 281, since they are the closest convex approxi- 
mants to rank(X) and || A||o, respectively. Accordingly, one solves 


mm II J^I 2 (Y-X-RA) 


I F T 


|X||*+Ai||A|| 


(29) 


where A*, Ai > 0 are rank- and sparsity-controlling parameters. While a non-smooth 
optimization problem, ( |29l l is appealing because it is convex. An efficient acceler¬ 
ated proximal gradient algorithm with quantifiable iteration complexity was devel¬ 
oped to unveil network anomalies | [50| . Interestingly, ( |29] l also offers a cleansed 
estimate of the link-level traffic X, that could be subsequently utilized for network 
tomography tasks. In addition, ( |29| l jointly exploits the spatio-temporal correlations 
in link traffic as well as the sparsity of anomalies, through an optimal single-shot 
estimation-detection procedure that turns out to outperform the algorithms in HD 
and 1861 (the latter decouple the estimation and detection steps); see Fig.[^ 
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Fig. 8 Unveiling anomalies from Intemet-2 data. (Left) ROC curve comparison between (29) and 
the PCA methods in E® , for different values of the rank(Z). Leveraging sparsity and low rank 
jointly leads to improved performance. (Right) In red, the estimated anomaly map A obtained via 
{29) superimposed to the “true” anomalies shown in blue (49| . 


5.2 In-network Traffic Anomaly Detection 

Implementing ( |29| ) presumes that network nodes continuously communicate their 
link traffic measurements to a central monitoring station, which uses their aggrega¬ 
tion in ^n(Y) to unveil anomalies. While for the most part this is the prevailing 
operational paradigm adopted in current networks, it is prudent to reflect on the lim¬ 
itations associated with this architecture. For instance, fusing all this information 
may entail excessive protocol overheads. Moreover, minimizing the exchanges of 
raw measurements may be desirable to reduce unavoidable communication errors 
that translate to missing data. Solving ( |29l ) centrally raises robustness concerns as 
well, since the central monitoring station represents an isolated point of failure. 

These reasons prompt one to develop fully-decentralized iterative algorithms for 
unveiling traffic anomalies, and thus embed network anomaly detection function¬ 
ality to the routers. As in Section]^ per iteration node i carries out simple com¬ 
putational tasks locally, relying on its own link count measurements (a submatrix 
Y, within Y := [YJ^,. .. ,YJ]^ corresponding to router fs links). Subsequently, lo¬ 
cal estimates are refined after exchanging messages only with directly connected 
neighbors, which facilitates percolation of local information to the whole network. 
The end goal is for network nodes to consent on a global map of network anomalies 
A, and attain (or at least come close to) the estimation performance of the central¬ 
ized counterpart ( [29) which has all data (Y) available. 

Problem ( |29) is not amenable to distributed implementation because of the non- 
separable nuclear norm present in the cost function. If an upper bound rank(X) < p 
is a priori available [recall X is the estimated link-level traffic obtained via (|29)], the 
search space of (|29) is effectively reduced, and one can factorize the decision vari¬ 
able as X = PQ \where P and Q are L x p and T x p matrices, respectively. Again, 
it is possible to interpret the columns of X (viewed as points in K^) as belonging 
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to a low-rank nominal subspace, spanned by the columns of P. The rows of Q are 
thus the projections of the columns of X onto the traffic subspace. Next, consider 
the following alternative characterization of the nuclear norm (see e.g. 1Z3) 

|lX||*:= min 1 (||P||2 + |1Q||2), s. to X = PQ^ (30) 

{P.Q} / 


where the optimization is over all possible bilinear factorizations of X, so that the 
number of columns p of P and Q is also a variable. Leveraging ( [30l l, the following 
reformulation of ( |29] l provides an important first step towards obtaining a decentral¬ 
ized algorithm for anomaly identification 


min y 

{P,Q,A},tl 






A* 

2n 


(n||PiilF + IIQIIf) + 


(31) 


which is non-convex due to the bilinear terms P/Q^, and where R := [Rj^,..., R,|] ^ 
is partitioned into local routing tables available per router i. Adopting the separable 
Frobenius-norm regularization in ( [3T| l comes with no loss of optimality relative to 
( |29l l, provided rank(X) < p. By finding the global minimum of ( |3T| ) [which could 
entail considerably less variables than (|29]l], one can recover the optimal solution 
of ( |29| l. But since ( [3T] i is non-convex, it may have stationary points which need not 
be globally optimum. As asserted in p8] Prop. 1] however, if a stationary point 
{P,Q,A} of ([^ satisfies ||,!^i 2 (Y-P^ - A)|| < A*, then {X := PQ'^,A := A} 
is the globally optimal solution of ( |29l l. 

To decompose the cost in ( |3T] i, in which summands inside the square brackets are 
coupled through the global variables {Q, A}, one can proceed as in Section|^and in¬ 
troduce auxiliary copies {Q;, A,}[Lj representing local estimates of {Q, A}, one per 
node i. These local copies along with consensus constraints yield the decentralized 
estimator 


min y 
{P/.Q,.A,},t'i 


II - P.Qj - R,-A,-)||^ + ^ {n\\Fifp + ||Q,-||2) + ^||A,|i i 


(32) 


s. to Q; = Q-, A; = A;, i=l,...,n, j £ i^j 


which follows the general form in ([^, and is equivalent to ( [3T| l provided the network 
topology graph is connected. Even though consensus is a fortiori imposed within 
neighborhoods, it carries over to the entire (connected) network and local estimates 
agree on the global solution of ( |3T] l. Exploiting the separable structure of ( [32l i using 
the ADMM, a general framework for in-network sparsity-regularized rank mini¬ 
mization was put forth in pS) . In a nutshell, local tasks per iteration k = 1,2,... 
entail solving small unconstrained quadratic programs to refine the normal sub¬ 
space P, [A:], in addition to soft-thresholding operations to update the anomaly maps 
A,'[A:] per router. Routers exchange their estimates {Q,[A:],A,[A:]} only with directly 






Decentralized Learning for Wireless Communications and Networking 


27 


connected neighbors per iteration. This way the communication overhead remains 
affordable, regardless of the network size n. 

When employed to solve non-convex problems such as ( |3^ , so far ADMM of¬ 
fers no convergence guarantees. However, there is ample experimental evidence in 
the literature that supports empirical convergence of ADMM, especially when the 
non-convex problem at hand exhibits “favorable” structure Q. For instance, ( [3^ is 
a linearly constrained bi-convex problem with potentially good convergence prop¬ 
erties - extensive numerical tests in pS) demonstrate that this is indeed the case. 
While establishing convergence remains an open problem, one can still prove that 
upon convergence the distributed iterations attain consensus and global optimality, 
thus offering the desirable centralized performance guarantees p8) . 


5.3 RF Cartography Via Decentralized Sparse Linear Regression 


In the domain of spectrum sensing for CR networks, RF cartography amounts to 
constructing in a distributed fashion: i) global power spectral density (PSD) maps 
capturing the distribution of radiated power across space, time, and frequency; and 
ii) local channel gain (CG) maps offering the propagation medium per frequency 
from each node to any point in space GD- These maps enable identihcation of 
opportunistically available spectrum bands for re-use and handoff operation; as well 
as localization, transmit-power estimation, and tracking of primary user activities. 
While the focus here is on the construction of PSD maps, the interested reader is 
referred to | |39| for a tutorial treatment on CG cartography. 

A cooperative approach to RF cartography was introduced in Q, that builds on a 
basis expansion model of the PSD map <P{x,f) across space x S and frequency 
/. Spatially-distributed CRs collect smoothed periodogram samples of the received 
signal at given sampling frequencies, based on which the unknown expansion co¬ 
efficients are determined. Introducing a virtual spatial grid of candidate source lo¬ 
cations, the estimation task can be cast as a linear LS problem with an augmented 
vector of unknown parameters. Still, the problem complexity (or effective degrees 
of freedom) can be controlled by capitalizing on two forms of sparsity: the first 
one introduced by the narrow-band nature of transmit-PSDs relative to the broad 
swaths of usable spectrum; and the second one emerging from sparsely located ac¬ 
tive radios in the operational space (due to the grid artifact). Nonzero entries in the 
parameter vector sought correspond to spatial location-frequency band pairs corre¬ 
sponding to active transmissions. All in all, estimating the PSD map and locating 
the active transmitters as a byproduct boils down to a variable selection problem. 
This motivates well employment of the ADMM and the least-absolute shrinkage 
and selection operator (Lasso) for decentralized sparse linear regression | 48|[5T |, an 
estimator subsumed by ( |29l l when X = O^xr, T = 1, and matrix R has a specihc 
structure that depends on the chosen bases and the path-loss propagation model. 

Sparse total LS variants are also available to cope with uncertainty in the re¬ 
gression matrix, arising due to inaccurate channel estimation and grid-mismatch 
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effects p9| . Nonparametric spline-based PSD map estimators Q have been also 
shown effective in capturing general propagation characteristics including both 
shadowing and fading; see also Fig.j^for an actual PSD atlas spanning 14 frequency 
sub-bands. 


[dB] 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 


Fig. 9 Spline-based RF cartography from real wireless LAN data. (Left) Detailed floor plan 
schematic including the location of A = 166 sensing radios; (Right-bottom) original measure¬ 
ments spanning 14 frequency sub-bands; (Right-center) estimated maps over the surveyed area; 
and (Right-top) extrapolated maps. The proposed decentralized estimator is capable of recovering 
the 9 (out of 14 total) center frequencies that are being utilized for transmission. It accurately re¬ 
covers the power levels in the surveyed area with a smooth extrapolation to zones were there are 
no measurements, and suggests possible locations for the transmitters . 


6 Convergence Analysis 

In this section we analyze the convergence and assess the rate of convergence for 
the decentralized ADMM algorithm outlined in Section]^ We focus on the batch 
learning setup, where the local cost functions are static. 


6.1 Preliminaries 

Network model revisitied and elements of algebraic graph theory. Recall the 
network model briefly introduced in Section based on a connected graph com¬ 
posed of a set of n nodes (agents, vertices), and a set of L edges (arcs, links). 
Each edge e = ((, 7 ) represents an ordered pair [i^j) indicating that node i com¬ 
municates with node j. Communication is assumed bidirectional so that per edge 
e = (1,7), the edge e' = (7,/) is also present. Nodes adjacent to i are its neighbors 
and belong to the (neighborhood) set jYi. The cardinality of equals the de¬ 
gree dj of node i. Let G denote the block edge source matrix, where 

the block ,■ = Ip G if the edge e originates at node i, and is null oth- 
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erwise. Likewise, define the block edge destination matrix A^/ G where 

the block [A^/]ej = Ip G if the edge e terminates at node j, and is null 

otherwise. The so-termed extended oriented incidence matrix can be written as 
Eo = Ag — Aii, and the unoriented incidence matrix as E„ = A* + A^/. The extended 
oriented (signed) Laplacian is then given by Lq = (1 /2)Eq E^, the unoriented (un¬ 
signed) Laplacian by L„ = (1/2)E^E„, and the degree matrix D = diag((ii,... 
is D=(1/2)(L„ -|-L„). With Fu denoting the largest eigenvalue of L„, and Yo the 
smallest nonzero eigenvalue of L^, basic results in algebraic graph theory establish 
that both Fu and Yo are measures of network connectedness. 

Compact learning problem representation. With reference to the optimization 
problem Q, define s := [s]^...s^]^ G concatenating all local estimates s,, 
and z := [zj.. .zj]^ G R^p concatenating all auxiliary variables Zg — z-j. For no- 
tational convenience, introduce the aggregate cost function / : —>■ K as /(s) := 

£JLi/,(s;;y,). Using these definitions along with the edge source and destination 
matrices, 0 can be rewritten in compact matrix form as 

min/(s), s. to A^s —z = 0, A^/s —z = 0. 

S 

Upon defining A := [AJaJ]^ G K^Lpxnp g ._ ( [^ reduces to 

min/(s), s. to As-LBz = 0. 

S 

As in Section]^ consider Lagrange multipliers Vg = v/ associated with the con¬ 
straints s, = sj, and Vg = yj associated with Sj = sj. Next, define the supervec¬ 
tors V := [vj".. .vj]"'' G and v := ... vj]"'' G IR^^, collecting those multipli¬ 

ers associated with the constraints A^s — z = 0 and A^s — z = 0, respectively. Fi¬ 
nally, associate multipliers v := [v^ G with the constraint in ( [3^ , namely 
As -f Bz = 0. This way, the augmented Lagrangian function of ( [3^ is 

Lc(s,z,v) = /(s) -f v''"(As-f Bz) -I- ^||As-|-Bz||^ 

where c > 0 is a positive constant [cf. Q back in Section]^. 

Assumptions and scope of the convergence analysis. In the convergence analysis, 
we assume that 0 has at least a pair of primal-dual solutions. In addition, we make 
the following assumptions on the local cost functions /). 

Assumption 1. The local cost functions /, are closed, proper, and convex. 

Assumption 2. The local cost functions /,■ have Lipschitz gradients, meaning there 
exists a positive constant Mf > 0 such that for any node i and for any pair of points 
s„ and Si, it holds that || V/;(s„) - V/,(si)|| < M/||Sfl -Si||. 

Assumption 3. The local cost functions /, are strongly convex; that is, there exists 
a positive constant nif > 0 such that for any node i and for any pair of points Sq and 
Si, it holds that (§„ - -'^Mh))>mf\\Sa-hf. 
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Assumption 1 implies that the aggregate function /(s) := Yli=\ //(Siiyi) is closed, 
proper, and convex. Assumption 2 ensures that the aggregate cost / has Lipschitz 
gradients with constant thus, for any pair of points Sa and it holds that 

l|V/(s«)-V/(si)|| <M/||s„-s,|l. (33) 

Assumption 3 guarantees that the aggregate cost / is strongly convex with constant 
irif', hence, for any pair of points Sa and Sh it holds that 

(Sa-S*)"^(V/(Sa)-v/(s*)) >mf\\Sa-Sbf. (34) 

Observe that Assumptions 2 and 3 imply that the local cost functions and the 
aggregate cost function / are differentiable. Assumption 1 is sufficient to prove 
global convergence of the decentralized ADMM algorithm. To establish linear rate 
of convergence however, one further needs Assumptions 2 and 3. 


6.2 Convergence 


In the sequel, we investigate convergence of the primal variables H{k) and x{k) as 
well as the dual variable \{k), to their respective optimal values. At an optimal 
primal solution pair (s*,z*), consensus is attained and s* is formed by n stacked 
copies of s*, while z* also comprises L stacked copies of s*, where s* = s is an 
optimal solution of Q. If the local cost functions are not strongly convex, then there 
may exist multiple optimal primal solutions; instead, if the local cost functions are 
strongly convex (i.e.. Assumption 3 holds), the optimal primal solution is unique. 

For an optimal primal solution pair (s*, z*), there exist multiple optimal Lagrange 
multipliers v* := [(v*)^ where v* = —v* | ^[^ |. In the following conver¬ 

gence analysis, we show that \{k) converges to one of such optimal dual solutions 
V*. In establishing linear rate of convergence, we require that the dual variable is ini¬ 
tialized so that v(0) lies in the column space of Eo; and consider its convergence to a 
unique dual solution v* := [(v*)^ in which v* and v* also lie in the column 


space of Eq. Existence and uniqueness of such a v* are also proved in |42 73) 
Throughout the analysis, define 


u := 


H: 


0 


0 

il 


Lp 


We consider convergence of u(k) to its optimum u* := [(s*)"'' (v*where (s*, v*) 
is an optimal primal-dual pair. The analysis is based on several contraction inequal¬ 
ities, in which the distance is measured in the (pseudo) Euclidean norm with respect 
to the positive semi-definite matrix H. 

To situate the forthcoming results in context, notice that convergence of the cen¬ 
tralized ADMM for constrained optimization problems has been proved in e.g., ph) , 
and its ergodic O (1 /k) rate of convergence is established in 134 ^ . Eor non-ergodic 
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convergence, J2L proves an 0{l/k) rate, and p0| improves the rate to o{\/k). Ob¬ 
serve that in 1 20||35 1 the rate refers to the speed at which the difference between two 
successive primal-dual iterates vanishes, different from the speed that the primal- 
dual optimal iterates converge to their optima. Convergence of the decentralized 
ADMM is presented next in the sense that the primal-dual iterates converge to their 
optima. The analysis proceeds in four steps: 

SI. Show that ||u(A:) — u* ||^ is monotonic, namely, for all times A: > 0 it holds that 


|u(^-f l)-u*||^< Ilu(^)-u*||^-||u(^+l)-u(^)|l^. 


(35) 


S2. Show that ||u(^-|- 1) — u(^)||^ is monotonically non-increasing, that is 


l|u(^ + 2) -u(^-f l)|jH < ||u(^+ 1 ) -u(^)||h 


(36) 


S3. Derive an 0{\/k) rate in a non-ergodic sense based on ( |T5] l and ( |3^ , i.e., 

|ju(^-f l)-u(^)||^ < ^||u(0)-u*||^. (37) 


S4. Prove that u{k) := [s(^)^ v(A:)^]^ converges to a pair of optimal primal and dual 
solutions of ([33|l. 


The hrst three steps are similar to those discussed in 1 20|35| . Proving the last step 
is straightforward from the KKT conditions of ( [3T| . Under S1-S4, the main result 
establishing convergence of the decentralized ADMM is as follows. 


Theorem 1. If for iterations (|^ and (|^ the initial multiplier\{0) := [v(0)^ v(0)^]^ 
satisfies v(0) = —v(0), and z(0) is such that E„s(0) = 2z(0), then with the ADMM 
penalty parameter c > 0 it holds under Assumption 1 that the iterates &{k) and\{k) 
converge to a pair of optimal primal and dual solutions of 

Theorem [T] asserts that under proper initialization, convergence of the decen¬ 
tralized ADMM only requires the local costs /, to be closed, proper, and convex. 
However, it does not specify a pair of optimal primal and dual solutions of p^ , 
which {s{k),\{k)) converge to. Indeed, s{k) can converge to one of the optimal pri¬ 
mal solutions s*, and \{k) can converge to one of the corresponding optimal dual 
solutions V*. The limit (s*,v*) is ultimately determined by the initial s(0) and v(0). 
Indeed, the conditions in Theorem[^also guarantee ergodic and non-ergodic o{\/k) 
convergence rates in terms of objective error and successive iterate differences, as 
proved in the recent paper GD- 
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6.3 Linear Rate of Convergence 


Linear rate of convergence for the centralized ADMM is established in pT| , and 
for the decentralized ADMM in | |73[ . Similar to the convergence analysis of the last 
section, the proof includes the following steps: 

SI’. Show that ||u(A:) — u* ||g is contractive, namely, for all times A: > 0 it holds that 

||u(^+l)-u*||^< J^||u(^)-u*||^ (38) 

where 5 > 0 is a constant [cf ®]. Note that the contraction inequality ( [38] l 
implies Q-linear convergence of ||u(A:) — u* ||y. 

S2’. Show that ||s(/t+l)-s*||^ is R-linearly convergent since it is upper-bounded by 
a Q-linear convergent sequence, meaning 

||s(^-f l)-s*f < —||u(^)-u*||^ (39) 


where m/ is the strong convexity constant of the aggregate cost function /. 

We now state the main result establishing linear rate of convergence for the de¬ 
centralized ADMM algorithm. 

Theorem 2. If for iterations 0 and 0 the initial multiplier\{0) := [v( 0 )'''v( 0 ) 
satisfies v(0) = —v(0); the initial auxiliary variable z(0) is such that E„s(0) = 
2 z( 0 ); and the initial multiplier v( 0 ) lies in the column space o/Eo, then with the 
ADMM parameter c > 0, it holds under Assumptions 1-3, that the iterates s{k) and 
\{k) converge R-linearly to (s*,W), where S* is the unique optimal primal solution 
and V* is the unique optimal dual solution lying in the column space o/Eq. 

Theorem [^requires the local cost functions to be closed, proper, convex, strongly 
convex, and have Lipschitz gradients. In addition to the initialization dictated by 
Theorem[^ Theorem|^further requires the initial multiplier v(0) to lie in the column 
space of Eo, which guarantees that \{k) converges to v*, the unique optimal dual 
solution lying in the column space of Eq. The primal solution s{k) converges to s*, 
which is unique since the original cost function in Q is strongly convex. 

Observe from the contraction inequality ( |38| l that the speed of convergence is 
determined by the contraction parameter 5: A larger 5 means stronger contraction 
and hence faster convergence. Indeed, give an explicit expression of 5, that is 


5 = min 


'2cm fYo 

R-Lu 'c^DuYo + nMj 


(40) 


where my is the strong convexity constant of /, My is the Lipschitz continuity con¬ 
stant of V/, Yo is the smallest nonzero eigenvalue of the oriented Laplacian L^, L, 
is the largest eigenvalue of the unoriented Laplacian L„, c is the ADMM penalty 
parameter, and /i > 1 is an arbitrary constant. 
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As the current form of ( |40l l does not offer insights on how the properties of the 
cost functions, the underlying network, and the ADMM parameter influence the 
speed of convergence, 142 ^ finds the largest value of 5 by tuning the constant /r 
and the ADMM parameter c. Specifically, 142 ^ shows that 


c = Mf 



and 



\ 


1 

4Mj JL 


+ 1 - 


1 nif 
2Mf 



maximizes the right-hand side of (|40li, so that 


5 


Mf 




1 '«/ 7o 1 nif 
4 Til 2 Mf 


(41) 


The best contraction parameter 5 is a function of the condition number Mf/nif 
of the aggregate cost function /, and the condition number of the graph Tu/Jo- Note 
that we always have 5 < 1, while small values of 5 result when Mf/mf 1 or 
when Tii/Yi)^ 1; that is, when either the cost function or the graph is ill conditioned. 
When the condition numbers are such that Tu/Yo^ the condition number 

of the graph dominates, and we obtain 5 « YojTu, implying that the contraction 
is determined by the condition number of the graph. When M^/jiif > Tu/Yo, the 
condition number of the cost dominates and we have d « {mf/Mf)^Yo/Tu- In the 
latter case the contraction is constrained by both the condition number of the cost 
function, and the condition number of the graph. 
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